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1. INTRODUCTION 

There has been a significant increase in the number of liver disease patients over the last few years, 
resulting in many cases of death every year. Liver disease is one of the leading causes of mortality worldwide 
and constitutes a wide range of diseases with varied or unknown etiologies. For instance, a study shows that in 
2017 1.32 million deaths worldwide, or 2 to 4% of all annual deaths were caused directly due to cirrhosis [1]-[3]. 
With the help of automated decision-making methods using a machine learning model, the death caused due 
to liver disease can be reduced. 

Cirrhosis is any factor that harms the liver causing the liver to unfit for its proper functioning. 
Designing and developing a system for the prediction and diagnosis of liver disease can help doctors or health 
experts in detecting liver disease correctly [4]. The main purpose of the liver disease prediction system is to 
classify the given observation as a liver disease patient or not liver disease patient based on the symptoms of 
liver disease used in training a classification algorithm. 

Machine learning (ML) is the branch of artificial intelligence playing a major role in healthcare for 
the diagnosis of diseases [5]. The implementation of an automated liver disease diagnosis system plays an 
important role in reducing the mortality rate due to liver disease disorder. ML techniques improve the decision- 
making process by reducing the false positive rate and increasing the true positive rate during liver disease 
identification. 
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The major problem in liver disease diagnosis using ML techniques is improving the accuracy of ML 
algorithms employed in liver disease diagnosis [6]. Improved accuracy results in better diagnosis results reducing 
the false negatives and finally increasing the precision in the diagnosis of the liver. In disease diagnosis using ML 
techniques, clinical liver disease symptoms are usually used to identify patterns in the dataset to the class label. 
The patient will undergo further medical tests if the ML model identifies any pattern matching the positive class 
label. However, all the symptoms used for pattern matching that are presented in the liver disease dataset have no 
importance in the ML learning process. Hence, identifying the features that better characterize liver disease 
prediction is paramount in developing a more accurate liver diagnosis and prediction model [7]. Hence, this study 
is devoted to answering the following questions: 

a. | What are the risk factors for liver disease? 

b. What are the liver disease features that have higher importance to the learning process of the support 
vector machine? 

c. Howto improve the predictive accuracy of the support vector machine? 

The rest of this work is organized as follows: section 2 presents related work on liver disease diagnosis 
by using various machine learning algorithms. Section 3 discusses the method and dataset used for simulation and 
experimentation. Section 4 presents the results achieved, and finally, section 5 concludes the work. 


2. RELATED WORK 

Several studies have been conducted to predict liver disorders using various machine learning 
algorithms [8]. A framework for liver disease prediction is developed using a clinical liver disease dataset. The 
developed framework is implemented using hybrid feature selection and regression analysis. The model 
achieved 89.21% for liver disease diagnosis. 

Hashem and Mabrouk [9], developed a decision tree model for the prediction of the normal early 
stages of cirrhosis stages of the patient. Moreover, the study compares the decision tree model with the random 
forest (RF) model and the simulation result shows that higher accuracy of 70.67% is achieved with the random 
forest model as compared to the decision tree model. A support vector machine (SVM) algorithm-based liver 
disease diagnosis model is proposed in [10]. The SVM is trained on a clinical liver disease dataset collected 
from the University of Irvine UCI machine learning repository. The result of the experiment shows that 
promising result is obtained in liver disease prediction using the developed SVM model. The prediction 
accuracy achieved by the SVM model is 73.2% using the UCI data repository. 

In addition, Afrin et al. [11] conducted a comparative study on the prediction of liver disease using 
SVM and an Adaptive boosting algorithm. The SVM and an adoptive boosting algorithm are trained using 583 
samples of liver disease dataset collected from the University of California Irvine (UCI) data dataset. The 
simulation result shows that adaptive boosting outperforms the SVM model with a prediction accuracy of 
74.65%. Similarly, in [12], [13], a comparative study is conducted to analyze the performance of K-nearest 
neighbor (KNN), random forest, decision tree, and adoptive boosting algorithm using the UCI liver disease 
dataset. The result shows that the decision tree model out perms as compared to KNN, random forest, and 
adaptive boosting algorithm for liver disease prediction. 

Geethaet and Arunachalam [14], conducted comparative study on four machine learning algorithms 
namely, random forest, logistic regression, artificial neural network (ANN), and Naive Bayes (NB) is 
conducted. In the experimentation for comparative analysis of the performance of the four algorithms, accuracy 
is used as a criterion for measuring the performance. The result shows that random forest outperforms with the 
highest accuracy of 84.29% compared to the logistic regression, ANN, and NB. A comparative study on logistic 
regression and SVM for liver disease prediction shows that SVM outperforms as compared to logistic 
regression. Liver disease prediction accuracy of 75.04% is achieved using SVM. 

In [15] and [16], applied KNN and NB to develop a liver disease prediction model. The prediction 
accuracy for the KNN model achieved 72.5% while the NB model achieved a prediction accuracy of 63.19%. 
Hence, the KNN model out perms as compared to the NB model for liver disease prediction. 

Other studies [17]-[19] show that feature selection is important for improving the performance of the 
machine learning model for liver disease diagnosis. With feature selection, the overlapping symptoms of a 
disease used in the training of the machine learning process can be interleaved. Hence, based on the literature, 
the researchers decided to apply recursive feature elimination for determining the significant features for 
improved performance. Moreover, SVM is employed for model training because different studies [20]-[25] 
show that the SVM model is effective in multi-class classification tasks such as liver disease prediction. 

As shown in the literature survey (section 2), the previous studies do not consider the effect of class 
imbalance on the predicted accuracy. Although better accuracy is achieved by the prior works the accuracy 
does not provide the class-wise performance of an ML model on liver disease diagnosis. Moreover, the features 
that contribute more to the learning process of the different machine learning algorithm is not presented in the 
literature. Thus, this work aims to address this gap i.e., to investigate liver disease diagnosis feature that is 
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relevant to the learning process of support vector machine and balancing the dataset using the synthetic 
minority oversampling technique (SMOTE). 


3. METHOD 

To select the optimal feature subset for improved performance on the dialogists of liver disease the 
following steps are followed as demonstrated in Figure 1. The proposed model is developed using the random 
forest for selecting convenient feature attributes of liver disease by calculating the feature importance and 
ranking the features according to their importance. Then the selected optimal feature subset is given as input 
to the SVM and the SVM is trained using the feature subset selected by the random forest feature ranking 
method. 


Liver disease feature Select optimalfeature 


Figure 1. SVM and RF-based hybrid model for liver disease detection 


3.1. Liver disease dataset analysis 

The liver disease dataset contains the data collected from the Mayo Clinic trial in major PBC of the 
liver conducted for ten years between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic 
during that ten-year interval, met permissibility conditions for the randomized placebo-controlled test of the 
drug D-penicillamine. The first 312 cases in the dataset participated in the randomized trial and contain largely 
complete data. The additional 112 cases did not contribute to the clinical test but subscribed to having basic 
measurements documented and to be followed for persistence. Six of those cases were lost to follow-up shortly 
after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. The 
proposed SVM-based liver disease diagnosis model is trained on the liver disease features shown in Table 1. 


Table 1. The variables in the liver disease dataset 
Variables of liver disease dataset 


Patient-ID 

Number of days between registration and the earlier of death, transplantation, or study analysis 

Status: status of the patient C=censored, CL=censored due to liver, or D=death 

Drug: type of drug D-penicillamine or placebo 

Age: age of the patient in days 

Sex: M=male or F=female 

Ascites: presence of ascites N=No or Y=Yes 

Hepatomegaly: hepatomegaly, not present=N or present=Y 

Spiders: the condition of spiders, no spider=N or spider present=Y 

0. Edema: the condition of edema, no edema, and no diuretic therapy for edema=N, S=edema present 
with no diuretics, or edema determined by diuretics, or Y=edema withstanding diuretic therapy. 

1. Bilirubin: serum bilirubin in mg/dl 

2. Cholesterol: serum cholesterol in mg/dl 

3. Albumin: albumin in gm/dl 

4. Copper: the value of urine copper in ug/day 

15. Alk Phos: the value of alkaline phosphatase in U/litter 

16. SGOT: the value of SGOT in U/ml 

17. Triglycerides: triglycerides in [mg/dl] 

8. Platelets: platelets per cubic [ml/1000] 

9. Prothrombin: prothrombin time in seconds [s] 

20. Stage: histologic stage of disease (1, 2, 3, or 4) 


BORN AARYNS 


To explore the effect of a feature on the SVM model, the authors trained the model on the original 
input feature, and the RFECV feature selection is applied to select the optimal feature. After that, the model is 
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trained in the optimal feature subset returned by the RFECV feature selection method. To quantify the effect 
of feature selection, accuracy is used for measuring the performance. The accuracy for SVM is determined by 
using the formula given in (1), 


Accuracy = — «100 (1) 
TP+TN+FP+FN 


where TP is true positive, the observations predicted as liver disease patients that belong to a patient class, TN 
is true negative, the observation predicted as liver disease negative patients that belong to the not disease class. 
FP is false positive, the observation predicted as liver disease patient that belongs to liver disease negative 
class, and FN is false negative, the observations predicted as liver disease negative but belong to the liver 
disease patient class. 


4. RESULTS AND DISCUSSION 

This section presents the features that are important for training the SVM model selected using random 
forest-based feature elimination. The effect of the irrelevant feature and the variation in accuracy using the 
optimal feature and original input feature is analyzed. 


4.1. Feature importance analysis 

Figure 2 demonstrates the significant features that are relevant to identifying the true positive instances 
of liver disease. As demonstrated in Figure 2, features such as aspartate aminotransferase, alkaline phosphate, 
total bilirubin, direct bilirubin, albumin, and age have a significant effect on the model output, in contrast, the 
albumin and globulin ratio, total proteins and gender has a lower impact on the model output. Thus, the use of 
the features with a higher impact on model output improves the performance of the SVM model for liver 
disease prediction. 


SHAP Feature Importance for test set 
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Figure 2. SHAP summary plot of liver disease features 


Figure 3 demonstrates the permutation feature importance using random forest-based feature 
elimination. As demonstrated in Figure 3, the features selected by recursive feature elimination using the 
random forest model are aspartate aminotransferase, alkaline phosphate, amine, aminotransferase, age, and 
total bilirubin, albumin, and direct bilirubin. Thus, the SVM model is trained on these features and the 
performance when the SVM is trained on the features shown in Figure 3 is demonstrated in Table 1. The 
recursive feature elimination with cross-validation (RFECV) random forest is employed to determine the 
optimal number of input features. As demonstrated in Figure 3, the RFECV returns 8 features as optimal input 
features that produce the highest possible accuracy of 77.5%. Based on the number of input features determined 
by the RFECV, the first 8 important features are selected by the permutation-based feature importance shown 
in Figure 2. Then the model is trained on the 8 optimal input features to obtain higher accuracy. 
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RFECV for RandomForestClassifier 
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Figure 3. Permutation feature importance using random forest 


5. CONCLUSION 

This work proposed an automated liver disease detection system by using a random forest model. The 
developed system is tested on a real-world liver disease dataset. Simulation using the test set shows that the 
developed system has achieved acceptable accuracy. With such an automated system the required precision 
can be achieved as the system aids in the decision-making process of liver disease diagnosis. The simulation 
shows that the developed liver disease diagnosis system has 78.3% accuracy. The result also shows that the 
accuracy improves by 10.2% when the REFCV is used for feature selection using synthetic minority 
oversampling. Thus, the developed model is significantly important to assist the medical expert in the 
prediction of liver disease. 
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